Section 2 - Principal Components Analysis

PCA is a linear transformation:

\begin{equation*} Z = XQ \end{equation*}

If a vector is multiplied by a scalar, the direction stays the same, however, the length can be changes.

If a vector is multiplies by a matrix it can "point" in a new direction (rotated).

Input data X is of shape (N, D).

Tranformation matrix Q is of shape (D x D).

Transformed data Z is of shape (N x D).

Multiplying a vector by a matrix can be seen as rotating the coordinate system of the vector.

PCA is often used for dimensionality reduction.

Information carried by a variable can be assesed by measuring the variance.

A variable having variance equal to zero provides no information.

Another advantage of PCA is that it de-correlate the variables.

High correlation between two variables means that one of them can be disregarded since they provide (approximately) very similar information.

The goal of PCA is to find a matrix Q that will help us find the correlation between the variable so that we have the smallest number of the most crucial information

PCA can help reducing the impact of noise

Assuming that the noise is not a major factor in the data, it is fair to state the principal component of the noise will be somewhere down the list. Thus, the PCA transformation will perform denoising pre-processing.

Furthermore, by reducing number of parameters, risk of fitting the model to the noise or overfitting is significantly reduced.

Latent variables Z can be interpreted as the underlying cause of X

That examplains them being significantly uncorrelated.

Very important: Uncorrelated does not necessarily mean independent, unless distribution is Gaussian.

In PCA, the linear transformation can be applied both ways:

\begin{equation*} Z = XQ \end{equation*}\begin{equation*} X = ZQ^{-1} \end{equation*}

PCA Derivation

Covariance of X is given as: \begin{equation*} Cov(X)= \frac{1}{N} (X - \mu _{X})^{T} (X - \mu _{X}) \end{equation*}

Cov(X) is a D by D matrix.

Eigenvalues and Eigenvectors

Eigenvectors (v) are non-zero vectors vectors that only change by a scalar factor , when a linear transformation is applied to it.

\begin{equation*} Av = \lambda v \end{equation*}

The equation can be re-written as:

\begin{equation*} | Av - \lambda I | v = 0 \end{equation*}

It has a non-zero solution if and only if the determinant is zero: \begin{equation*} det| Av - \lambda I | = 0 \end{equation*}

λ is known as the eigenvalues or characteristic root associated with the eigenvector v.

The goal is to find the eigenvalues and corresponding to them eigenvectors.

There will be D number of eigenvalues (because X is of shape N by D) and their values will be greater or equal to zero.

After finding the eigenvectors and corresponding to them eigenvalues the eigenvalues will be sorted in the descending order and combined in a diagonal matrix:

\begin{equation*} \wedge = \begin{pmatrix} \lambda_{1} & 0 & . & 0\\ 0 & \lambda_{2} & . & 0 \\ . & . & . &. \\ . & . & . & \lambda_{D} \end{pmatrix} \end{equation*}

A V matrix is constructed as follows: \begin{equation*} V = \begin{pmatrix} 1 & 1 & . & 1 \\ v_{1} & v_{2} & . & v_{D}\\ 1 & 1 & . & 1 \end{pmatrix} \end{equation*}

The V matrix is orthonormal which results in that if the vector V is multiplied by another eigenvector the result will be zero.

\begin{equation*} Cov(X)V = V \lambda \end{equation*}

Furthermore, covariance of Z is given as: \begin{equation*} Cov(Z)= \frac{1}{N} (Z - \mu _{Z})^{T} (Z - \mu _{Z}) = \frac{1}{N} (XQ - \mu _{X}Q)^{T} (XQ - \mu _{X}Q) = \frac{1}{N} Q^{T}(X - \mu _{X})^{T} (X - \mu _{X})Q = Q^{T} Cov(X) Q \end{equation*}

By choosing Q = V we obtain: \begin{equation*} Cov(Z) = V^{T} Cov(X) V = V^{T} V \wedge = \wedge \end{equation*}

Off diagonal elements are equal to zero which means that no dimension is correlated with any other dimension. All values of λ were sorted in descending order so the first column of Z has the most variance and thus carries the most information.

PCA Objective Function

PCA minimized objection function given as: \begin{equation*} J = Cov(|X - X_{reconstructed} |^{2}) \end{equation*}

Perfect reconstruction would lead to the error being equal to zero, because: \begin{equation*} X_{reconstructed} = X Q Q^{-1} \end{equation*} Since: \begin{equation*} Q Q^{-1} = I \end{equation*}

Furthermore, Q is orthonormal so: \begin{equation*} Q^{-1} = Q^{T} \end{equation*}

Finally, we can re-state the problem as: \begin{equation*} X_{reconstructed} = X Q Q^{T} \end{equation*}

Reconstruction

The initial data can be reconstructed with a K number of eigenvectors leading to a non-zero reconstruction error. It is a trade-off for lower dimensionality.

\begin{equation*} X_{reconstructed} = X Q_{K} Q_{K}^{T} \end{equation*}

In that scenario the objective function is re-written as (often referred to as Frobenius Norm): \begin{equation*} J = Cov(|x(n) - Q_{K} Q_{K}^{T} x(n) |^2) =|X - XQ_{K} Q_{K}^{T}|_{F}^2 \end{equation*}

Implementation from scratch

Based on Sebastian Raschka's blog post: http://sebastianraschka.com/Articles/2014_pca_step_by_step.html

PCA step by step

1) Load a dataset without labels. The data will be d-dimensional.

2) Calculate means for every dimension and samples in the dataset.

3) Compute the covariance matrix for the whole dataset

4) Compute eigenvectors (noted as e_1, e_2,..e_d) and corresponding eigenvalues (lambda_1, lambda_2, ... lambda_d)

5) Sort the eigenvectors by eigenvalues (descending) and choose k of them with the largest eigenvalues.

6) Form a d by k dimensional matrix W, where every columns represents an eigenvector.

7) Use the matrix W to transform the data onto the new subspace:

\begin{equation*} y = W_{T} \times x \end{equation*}

where:

x is a d by 1-dimensional vector representing one sample and y is the transformed k by 1-dimensional sample in the new subspace.

Generate the data

3-dimensional dataset from multivariate Gaussian distribution.

More information can be found:

https://faculty.math.illinois.edu/~r-ash/Stat/StatLec21-25.pdf


In [1]:
import numpy as np
np.random.seed(123456789)

In [2]:
# Number of samples per class
samples_per_class = 200

In [3]:
# Generate the first "class" samples
# Mean
mu_vec1 = np.array([0, 0, 0])

# Covariance
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])

# Samples
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, samples_per_class).T

In [4]:
class1_sample.shape


Out[4]:
(3, 200)

In [5]:
# Generate the second "class" samples
# Mean
mu_vec2 = np.array([1, 1, 1])

# Covariance
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])

# Samples
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, samples_per_class).T

In [6]:
class2_sample.shape


Out[6]:
(3, 200)

In [7]:
# Visualizing the data
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [8]:
fig = plt.figure(figsize = (12, 12))
ax = fig.add_subplot(111, projection = '3d')
plt.rcParams['legend.fontsize'] = 10   
ax.plot(class1_sample[0,:], class1_sample[1,:], class1_sample[2,:], 'o', 
        markersize = 8, color = 'blue', alpha = 0.5, label = 'class1')
ax.plot(class2_sample[0,:], class2_sample[1,:], class2_sample[2,:], '^', 
        markersize = 8, alpha = 0.5, color = 'red', label = 'class2')

plt.title('Samples for class 1 and class 2')
ax.legend(loc = 'upper right')


Out[8]:
<matplotlib.legend.Legend at 0x2721add4e80>

In [9]:
# Gathering the dataset together
all_samples = np.concatenate((class1_sample, class2_sample), 
                             axis = 1)

In [10]:
# Computing the d-dimensional mean vector
mean_x = np.mean(all_samples[0,:])
mean_y = np.mean(all_samples[1,:])
mean_z = np.mean(all_samples[2,:])

mean_vector = np.array([[mean_x],[mean_y],[mean_z]])
mean_vector


Out[10]:
array([[0.4890709 ],
       [0.58096305],
       [0.40024479]])

In [11]:
# Computing the Scatter Matrix
# Intializing the scatter matrix as zeros
# It will be of size 3 by 3 in this specific example
dimensionality = all_samples.shape[0]

scatter_matrix = np.zeros((dimensionality, dimensionality))
for i in range(all_samples.shape[1]):
    scatter_matrix += (all_samples[:,i].reshape(dimensionality,1) - mean_vector).dot((all_samples[:,i].reshape(dimensionality,1) - mean_vector).T)
    
scatter_matrix


Out[11]:
array([[499.4167083 , 135.24138206,  89.9976988 ],
       [135.24138206, 438.61830474, 102.19588451],
       [ 89.9976988 , 102.19588451, 476.82651834]])

In [12]:
# Computing the Covariance Matrix
cov_mat = np.cov([all_samples[0,:], all_samples[1,:], all_samples[2,:]])
cov_mat


Out[12]:
array([[1.25167095, 0.33895083, 0.22555814],
       [0.33895083, 1.099294  , 0.25613004],
       [0.22555814, 0.25613004, 1.19505393]])

In [13]:
# Calculating Eigenvectors and Eigenvalues from the scatter matrix
eig_val_sc, eig_vec_sc = np.linalg.eig(scatter_matrix)
for i in range(len(eig_val_sc)):
    eigvec_sc = eig_vec_sc[:,i].reshape(1, dimensionality).T

In [14]:
eig_val_sc


Out[14]:
array([691.88036512, 398.80615223, 324.17501403])

In [15]:
eig_vec_sc


Out[15]:
array([[-0.63913277, -0.57920898,  0.50599038],
       [-0.55580333, -0.10689217, -0.82441295],
       [-0.5315938 ,  0.80814047,  0.25360799]])

In [16]:
# Calculating Eigenvectors and Eigenvalues from the covariance matrix
eig_val_cov, eig_vec_cov = np.linalg.eig(cov_mat)
for i in range(len(eig_val_cov)):
    eigvec_cov = eig_vec_cov[:,i].reshape(1, dimensionality).T

In [17]:
eig_val_cov


Out[17]:
array([1.734036  , 0.99951417, 0.81246871])

In [18]:
eig_vec_cov


Out[18]:
array([[-0.63913277, -0.57920898,  0.50599038],
       [-0.55580333, -0.10689217, -0.82441295],
       [-0.5315938 ,  0.80814047,  0.25360799]])

The eigenvectors are the same, but the eigenvalues will different.


In [19]:
# Sorting the eigenvectors by decreasing eigenvalues
# Creating tuples as follows (eigenvalues, eigenvector)
eig_pairs = [(np.abs(eig_val_sc[i]), eig_vec_sc[:,i]) for i in range(len(eig_val_sc))]
eig_pairs


Out[19]:
[(691.88036512278, array([-0.63913277, -0.55580333, -0.5315938 ])),
 (398.80615222880954, array([-0.57920898, -0.10689217,  0.80814047])),
 (324.1750140261993, array([ 0.50599038, -0.82441295,  0.25360799]))]

In [20]:
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse = True)
eig_pairs


Out[20]:
[(691.88036512278, array([-0.63913277, -0.55580333, -0.5315938 ])),
 (398.80615222880954, array([-0.57920898, -0.10689217,  0.80814047])),
 (324.1750140261993, array([ 0.50599038, -0.82441295,  0.25360799]))]

In [21]:
# Choosing two eigenvectors with the largest eigenvalues
k_number_of_dimensions = 2
top_k_pairs = eig_pairs[:k_number_of_dimensions]
matrix_w = np.hstack((top_k_pairs[0][1].reshape(dimensionality,1), top_k_pairs[1][1].reshape(dimensionality,1)))

In [22]:
# Transforming the samples onto the new subspace (2 - dimensional)
transformed = matrix_w.T.dot(all_samples)

In [23]:
plt.figure(figsize = (12, 12))
plt.plot(transformed[0, 0:samples_per_class], transformed[1,0:samples_per_class], 
         'o', 
         markersize = 7, 
         color = 'blue', 
         alpha = 0.5, 
         label = 'class1')
plt.plot(transformed[0, samples_per_class:], transformed[1,samples_per_class :], 
         '^', 
         markersize = 7, 
         color = 'red', 
         alpha = 0.5, 
         label = 'class2')
plt.xlim([-4, 4])
plt.ylim([-4, 4])
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.title('Transformed samples with class labels')

plt.show()


Sklearn implementation

Loading the data


In [24]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle

In [25]:
def getKaggleMNIST():
    # Column 0 is labels
    # Column 1-785 is data, with values 0 .. 255
    train = pd.read_csv('./data/Section 2/train.csv').as_matrix().astype(np.float32)
    train = shuffle(train)

    Xtrain = train[:-1000, 1:] / 255
    Ytrain = train[:-1000, 0].astype(np.int32)

    Xtest  = train[-1000:, 1:] / 255
    Ytest  = train[-1000:, 0].astype(np.int32)
    return Xtrain, Ytrain, Xtest, Ytest

In [26]:
Xtrain, Ytrain, Xtest, Ytest = getKaggleMNIST()

In [27]:
Xtrain.shape


Out[27]:
(41000, 784)

PCA from Sklearn


In [28]:
from sklearn.decomposition import PCA

In [29]:
pca = PCA()
reduced = pca.fit_transform(Xtrain)

In [30]:
reduced.shape


Out[30]:
(41000, 784)

In [31]:
# Variance of each column
pd.DataFrame(data=reduced).var()


Out[31]:
0      5.140932e+00
1      3.785782e+00
2      3.248402e+00
3      2.840127e+00
4      2.584887e+00
5      2.274046e+00
6      1.730318e+00
7      1.525671e+00
8      1.460173e+00
9      1.241773e+00
10     1.109470e+00
11     1.086073e+00
12     8.993473e-01
13     8.937030e-01
14     8.351696e-01
15     7.827291e-01
16     6.973970e-01
17     6.776641e-01
18     6.260919e-01
19     6.087995e-01
20     5.672881e-01
21     5.362141e-01
22     5.085567e-01
23     4.820611e-01
24     4.682386e-01
25     4.431360e-01
26     4.280157e-01
27     4.102884e-01
28     3.917249e-01
29     3.622263e-01
           ...     
754    2.132367e-32
755    2.132174e-32
756    2.132378e-32
757    2.132417e-32
758    2.132414e-32
759    2.132420e-32
760    2.132300e-32
761    2.132392e-32
762    2.132355e-32
763    2.132405e-32
764    2.132379e-32
765    2.132368e-32
766    2.132413e-32
767    2.132416e-32
768    2.132421e-32
769    2.132378e-32
770    2.132237e-32
771    2.132421e-32
772    2.132339e-32
773    2.132302e-32
774    2.132385e-32
775    2.132383e-32
776    2.132386e-32
777    2.132342e-32
778    2.130706e-32
779    2.113551e-32
780    1.628526e-32
781    6.442654e-33
782    2.321912e-33
783    2.815062e-34
Length: 784, dtype: float64

In [32]:
import matplotlib.pyplot as plt
plt.figure(figsize = (12, 12))
plt.scatter(reduced[:, 0], 
            reduced[:, 1], 
            s = 100, 
            c = Ytrain, 
            alpha = 0.5)
plt.show()



In [33]:
plt.figure(figsize = (12, 12))
plt.plot(pca.explained_variance_ratio_)
plt.show()



In [34]:
# In order to not to lose too much information, the number of dimensions the data is reduced to should be chosen in such way that 
# the cummulative information is above 95%. 
# In this scenario it 
cumulative = []
last = 0
for v in pca.explained_variance_ratio_:
    cumulative.append(last + v)
    last = cumulative[-1]
plt.figure(figsize = (12, 12))
plt.plot(cumulative)
plt.show()